IE510 Term Paper: Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz Algorithm
نویسندگان
چکیده
In this paper, we mainly study the convergence properties of stochastic gradient descent (SGD) as described in Needell et al. [2]. The function to be minimized with SGD is assumed to be strongly convex. Also, its gradients are assumed to be Lipschitz continuous. First, we discuss the superior bound on convergence (of standard SGD) obtained by Needell et al. [2] as opposed to the previous work of Bach and Moulines [1]. Then, we show that this bound can be further improved if SGD is performed with importance (weighted) sampling instead of uniform sampling. Finally, we study two applications: Logistic Regression and Kaczmarz Algorithm and demonstrate faster convergence obtained from SGD with weighted (or partially weighted) sampling. 1 Stochastic Gradient Descent In stochastic gradient descent (SGD), we minimize a function F (w) using stochastic gradients in the update rule at each step. The stochastic gradients g are such that their expectation is the gradient of F (w), E[g] = ∇F (w). Now, if F (w) = Ei∼D[fi(w)], then we have g = ∇fi(w). The update rule then is as follows, wk+1 = wk − γ∇fik(w), (1) where γ is the step size and ik is drawn i.i.d from some distribution D. In this paper, we will study the convergence of SGD on the function F , with the following assumptions 1. Each fi is continuously differentiable and the gradient of fi has a Lipschitz constant Li, i.e. ‖∇fi(w1)−∇fi(w2)‖2 ≤ Li‖w1 − w2‖2 2. F is strongly convex with parameter μ, i.e. F (w1) ≥ F (w2) +∇F (w2) (w1 − w2) + μ2 ‖w1 − w2‖ 2 2, or equivalently (w1 − w2) (∇F (w1)−∇F (w2)) ≥ μ‖w1 − w2‖2
منابع مشابه
Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm
We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning (L/μ) (where L is a bound on the smoothness and μ on the strong convexity) to a linear dependence on L/μ. Furthermore, we show how reweighting the sampling distribution (i.e. importance samplin...
متن کاملBeneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling withand without-replacement in such algorithms. Focusing on least means squares...
متن کاملToward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling withand without-replacement in such algorithms. Focusing on least means squares...
متن کاملRows vs Columns for Linear Systems of Equations - Randomized Kaczmarz or Coordinate Descent?
This paper is about randomized iterative algorithms for solving a linear system of equations Xβ = y in different settings. Recent interest in the topic was reignited when Strohmer and Vershynin (2009) proved the linear convergence rate of a Randomized Kaczmarz (RK) algorithm that works on the rows of X (data points). Following that, Leventhal and Lewis (2010) proved the linear convergence of a ...
متن کاملOn the Relation Between the Randomized Extended Kaczmarz Algorithm and Coordinate Descent
In this note we compare the randomized extended Kaczmarz (EK) algorithm and randomized coordinate descent (CD) for solving the full-rank overdetermined linear least-squares problem and prove that CD needs less operations for satisfying the same residual-related termination criteria. For the general least-squares problems, we show that running first CD to compute the residual and then standard K...
متن کامل